Goto

Collaborating Authors

 Marquette County


Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

Li, Weizhen, Lin, Jianbo, Jiang, Zhuosong, Cao, Jingyi, Liu, Xinpeng, Zhang, Jiayu, Huang, Zhenqiang, Chen, Qianben, Sun, Weichen, Wang, Qiexiang, Lu, Hongxuan, Qin, Tianrui, Zhu, Chenghao, Yao, Yi, Fan, Shuying, Li, Xiaowan, Wang, Tiannan, Liu, Pai, Zhu, King, Zhu, He, Shi, Dingfeng, Wang, Piaohong, Guan, Yeyi, Tang, Xiangru, Liu, Minghao, Jiang, Yuchen Eleanor, Yang, Jian, Liu, Jiaheng, Zhang, Ge, Zhou, Wangchunshu

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To elicit end-to-end chain-of-agents problem-solving abilities in LLMs, we introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models' capabilities on chain-of-agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open-sourced, which offers a solid starting point for future research on agent models and agentic RL.


Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

Edemacu, Kennedy, Shashidhar, Vinay M., Tuape, Micheal, Abudu, Dan, Jang, Beakcheol, Kim, Jong Wook

arXiv.org Artificial Intelligence

--Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker chosen response for a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems. KEY challenge associated with large language models (LLMs) [1]-[3] is their tendency of becoming outdated and struggling to integrate the most recent knowledge [4], [5]. This fundamental short-coming is addressed by the recent emergency of retrieval-augmented generation (RAG) [6]-[9].


Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Gao, Zhi, Zhang, Bofei, Li, Pengxiang, Ma, Xiaojian, Yuan, Tao, Fan, Yue, Wu, Yuwei, Jia, Yunde, Zhu, Song-Chun, Li, Qing

arXiv.org Artificial Intelligence

Query: I want to buy a PS5 for each child in the photo. Thought: Use the `facedetection` tool to detect Thought: First analyze the image 1 to find the number human faces in the two images. Faces in Image 1: 4 bounding boxes Thought: There are 4 children in total. The price of Price of PS5: $479.99 a PS5 is approximately $500, so the cost is 4* 500. Thought: Using the price of $479.99 for each console. Query: The men in the picture want to buy one NVIDIA GeForce RTX 4070 SUPER each. According to the price in January, how many dollars will they need to spend in total? Observation: This image does not provide any price. On January 8, 2024, Nvidia released the RTX Thought: I cannot obtain useful information. I 4070 SUPER at $599, think the price is about $1800 for three men. Thought: The price is $599. Our agent chooses more precise tools based on the given files and intermediate observations. The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via Trajectory Tuning on VLMs for Tool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B Integrating external tools to solve diverse multi-modal tasks is a promising research direction towards multi-modal agents (Surís et al., 2023; Gupta & Kembhavi, 2023; Gao et al., 2024; Yuan et al., 2024; Zhong et al., 2023). Existing agents usually use a large language model (LLM) as the controller that generates plans via prompt engineering to call tools, achieving impressive performance in multiple domains, such as image editing (Wu et al., 2023), robotic manipulation (ichter et al., 2023), question answering (Shen et al., 2024), video understanding (Fan et al., 2024), and desktop APPs (Trivedi et al., 2024). Despite their success, prompt engineering faces limited reasoning abilities for tool usage in tackling practical tasks, as shown in Figure 1.


Woman, 101, is mistaken for a BABY because American Airlines' computer system can't accept that she was born in 1922 and not 2022 - as she jokes 'they thought I was a child and I'm an old lady!'

Daily Mail - Science & tech

A woman flying from Chicago to Marquette, Michigan was left baffled this week, after being mistaken for a baby. Patricia, 101, was boarding the flight with her daughter, Kris, when she was confronted by the cabin crew. Bizarrely, they had expected her to be aged one, due to an error with American Airlines' booking system. Patricia, who did not want her surname shared, was born in 1922, rather than 2022 - something the computer system could not seem to accept. Speaking to the BBC, who witnessed the mix-up, she said: 'It was funny that they thought I was only a little child and I'm an old lady!' A woman flying from Chicago to Marquette, Michigan was left baffled this week, after being mistaken for a baby.


SparQ Attention: Bandwidth-Efficient LLM Inference

Ribar, Luka, Chelombiev, Ivan, Hudlass-Galley, Luke, Blake, Charlie, Luschi, Carlo, Orr, Douglas

arXiv.org Artificial Intelligence

Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.


Stable Voting

Holliday, Wesley H., Pacuit, Eric

arXiv.org Artificial Intelligence

We propose a new single-winner voting system using ranked ballots: Stable Voting. The motivating principle of Stable Voting is that if a candidate A would win without another candidate B in the election, and A beats B in a head-to-head majority comparison, then A should still win in the election with B included (unless there is another candidate A' who has the same kind of claim to winning, in which case a tiebreaker may choose between such candidates). We call this principle Stability for Winners (with Tiebreaking). Stable Voting satisfies this principle while also having a remarkable ability to avoid tied outcomes in elections even with small numbers of voters.